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ABSTRACT 

Despite recent advances in experimental approa- 
ches for identifying transcriptional c/s-regulatory 
modules (CRMs, 'enhancers'), direct empirical dis- 
covery of CRMs for all genes in all cell types and 
environmental conditions is likely to remain an 
elusive goal. Effective methods for computational 
CRM discovery are thus a critically needed comple- 
ment to empirical approaches. However, existing 
computational methods that search for clusters of 
putative binding sites are ineffective if the relevant 
TFs and/or their binding specificities are unknown. 
Here, we provide a significantly improved method 
for 'motif-blind' CRM discovery that does not de- 
pend on knowledge or accurate prediction of 
TF-binding motifs and is effective when limited know- 
ledge of functional CRMs is available to 'supervise' 
the search. We propose a new statistical method, 
based on 'Interpolated Markov Models', for motif- 
blind, genome-wide CRM discovery. It captures the 
statistical profile of variable length words in known 
CRMs of a regulatory network and finds candidate 
CRMs that match this profile. The method also uses 
orthologs of the known CRMs from closely related 
genomes. We perform in silico evaluation of pre- 
dicted CRMs by assessing whether their neighbor- 
ing genes are enriched for the expected expression 
patterns. This assessment uses a novel statistical 
test that extends the widely used Hypergeometric 
test of gene set enrichment to account for variability 
in intergenic lengths. We find that the new CRM pre- 
diction method is superior to existing methods. 



Finally, we experimentally validate 12 new CRM pre- 
dictions by examining their regulatory activity in vivo 
in Drosophila; 10 of the tested CRMs were found to 
be functional, while 6 of the top 7 predictions 
showed the expected activity patterns. We make 
our program available as downloadable source 
code, and as a plugin for a genome browser in- 
stalled on our servers. 

INTRODUCTION 

Temporal and spatial regulation of gene expression results 
from the interaction of transcription factors (TFs) with 
specific c/s-regulatory DNA sequences. These regulatory 
sequences are typically organized in a modular fashion 
with each module containing one or more binding sites for 
a specific combination of TFs (1). A 'tis- regulatory module' 
(CRM) can thus be defined as a collection of TF-binding 
sites that function jointly to regulate a discrete aspect of a 
gene's expression pattern. CRMs are generally described 
as being short (< 1000 bp) contiguous stretches of DNA, 
located few to hundreds of kilobases away from their as- 
sociated gene, and are found five-prime, three-prime and 
within introns of the gene. Davidson (2) estimates that 
there may be as many as 10-fold more CRMs than genes, 
yet the vast majority of CRMs have not been identified. 
The problem of CRM discovery is complicated by the fact 
that unlike protein-coding regions, which have recogniz- 
able sequence features such as open reading frames and 
codon-usage biases, no similar properties are known for 
CRMs; and unlike promoters, which by definition lie imme- 
diately five-prime to the gene and which contain a limited 
number of well-defined sequence motifs (3), CRMs are 
constrained neither in location nor by motif composition. 
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The traditional approach to CRM identification invo- 
lves a tedious process of testing many sequence fragments 
for regulatory activity in a reporter gene assay. Recently 
developed genomic techniques have led to relatively rapid 
screens for potential gene-regulatory regions; such tech- 
niques include chromatin immunoprecipitation coupled 
to genomic tiling arrays (ChlP-chip) (4) and ultra high- 
throughput sequencing (ChlP-Seq) (5). Despite their great 
promise, these empirical methods have limitations: it is 
currently infeasible to assay all tissue types under all con- 
ditions, and potential CRMs may be missed by these tech- 
niques. In recent years, computational methods have 
provided an attractive alternative for module identifica- 
tion. Computational methods (6-11) typically start with 
a small set of relevant TFs and their binding motifs, and 
search the genome for clusters of putative binding sites 
(motif matches). However, this approach is ineffective if 
the relevant TFs and/or their binding specificities are 
unknown. Motif databases for Drosophila currently cata- 
log only a fraction of the estimated number of TFs 
(12-14). Intense efforts are being made to characterize 
TF-binding specificities in mouse and fly (15,16), but 
these efforts are labor intensive and expensive, and the 
problem of sparse motif knowledge may thus persist for 
scientists studying organisms other than human, mouse 
and fruit fly. Moreover, for the majority of regulatory 
systems, some or most of the relevant TFs have not yet 
been identified. For these reasons, the application of 
motif-based computational methods has been limited to 
a few well-understood biological systems, where the neces- 
sary prior knowledge is available. Specialized computa- 
tional tools to discover motifs from the training data 
may alleviate the problem, but the modest success rate 
of motif finding programs (especially for metazoan gen- 
omes), as suggested by past surveys (17-19), casts doubts 
upon the prospect of CRM discovery based on computa- 
tional motif finding. 

This article examines CRM prediction in the common 
scenario where knowledge of the relevant TFs and/or their 
binding specificities (motifs) is missing. In particular, we 
ask the following question: suppose a small set of modules 
participating in a transcriptional network is known a 
priori. Can we use such information as 'training data' to 
guide the search for other modules in that network? We 
call this task the 'supervised CRM prediction problem'. 
The success of a CRM prediction may be defined as the 
ability of the predicted sequence to drive a discrete 
temporal and/or spatial expression pattern. An additional, 
stricter criterion for success is that the predicted CRM 
recapitulates the common expression pattern of the 
training CRMs. 

Here, we undertake supervised CRM discovery without 
assuming motif knowledge or the ability to accurately 
discover the relevant motifs ab initio. In a previous publi- 
cation (20), we proposed various statistics to capture the 
sequence similarity (due to shared TF-binding sites) be- 
tween a candidate CRM and the training set of modules. 
Our study included previously reported approaches to the 
problem (21,22) as well as some new statistical techniques. 
We demonstrated that these techniques could then be used 
effectively to discover CRMs in Drosophila and human 



genomes. Narlikar and colleagues (23) used a similar ap- 
proach to predict human heart enhancers. Here, we ad- 
vance the state-of-the-art in solving this problem by 
developing a new statistical approach that improves sig- 
nificantly upon previous methods. This new score is based 
on 'interpolated Markov models' (IMM) and the use of 
multi-species comparison. An IMM can be thought of as a 
combination of Markov chains of varying orders, and 
considers the frequencies of words of variable length in 
learning a generative probabilistic model from the training 
CRMs. We train an IMM on the training set of CRMs 
from D. melanogaster as well as their orthologs from other 
Drosophila species, and use the likelihood ratio of this 
model and a suitable null model as the score of a candi- 
date CRM. The new method is shown to be superior to 
existing techniques in terms of predictive accuracy, which 
is assessed using a new statistical test that extends the 
widely used Hypergeometric test to correct for an import- 
ant bias present in this test as applied to our setting. 

We experimentally validated 12 new CRM predictions by 
examining their regulatory activity in vivo, and found 10 of 
these to be functional. Moreover, we observed that the new 
IMM score can effectively discriminate CRMs whose 
regulatory activity matches that of the training set from 
sequences that either do not drive expression or drive an 
expression pattern different from that of the training set. 

The new CRM prediction method and evaluation strat- 
egy developed here are made publicly available at http:// 
veda.cs.uiuc.edu/scrm-2/. The method can also be applied 
on the Drosophila genome through an online genome 
browser interface (http://veda.cs.uiuc.edu/gs, plugin 
'Supervised CRM Discovery'). 

MATERIALS AND METHODS 

Data sets 

The CRM training sets were obtained from (24). The 
corresponding expression gene sets were taken from 
Supplementary Table S10 in (20) (see Methods in 
Supplementary Data for more details). Both training sets 
and expression gene sets are publically available at: http:// 
veda.cs.uiuc.edu/scrm-2/. 

IMM-based score 

We implemented the IMM introduced by (25) for finding 
microbial genes. Let S = {s\, s 2 ,- ■ ■, ^n} be a sequence of 
length N. For an «th-order IMM, the conditional prob- 
ability of generating St given its preceding context (of n 
characters), is: 

IMMnCy/ltfj-l, • • ■ ,Si-n) = ■ ■ ,Si-n)Pr(Si\Si-i,. . . ,Si-„) 

+(1 - X n (si_ u . . . ,s i _„))IMM n _i(Si\st-u . . . ,Si-„+i) 

(1) 

Here, Pr(j;|s;_i, . . . , S/_„) is proportional to the frequency 
of the word (sj-„, . . . ,Sj-i,Sj) in training data and 
. . . ,Si- n ) is a 'mixture weight' that depends on 
the number of occurrences of the word (.?,_„, . . . , in 
the training data [Xq — 1; see (25) for details]. Thus, 
IMM n (.v,|.v,_i, . . . ,.?,_„) is modeled as a mixture over n 
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different probability distributions on Sf, each of which is 
conditioned on a context of a different length (0, 1,. . ., n). 

We score every candidate CRM S using the log likeli- 
hood score defined as: 



score(S) = log 



p(S\lMM + 5 ) 
p(S\TMMj) 



(2) 



where IMM, and IMM</ are two 5th-order IMMs trained 
on positive and negative sequences, respectively. The 
training set was used as positive sequences (both strands 
were used for training the model while ignoring the stat- 
istical dependencies between the strands), while the nega- 
tive sequences used were regions of non-coding genomic 
sequence with G/C content similar to the native flanks of 
the CRMs. The training data set is publicly available at: 
http://veda.cs. uiuc .edu/scrm-2 / 

Multi-species scoring scheme 

Given a training set of CRMs from D. melanogaster, we 
obtained orthologous sequences of each CRM from 10 
other Drosophila species using the lift Over program from 
the UCSC Genome Browser database, and treated the 
resulting set of CRMs as the sequence data on which the 
model (IMM or HexMCD) was trained (see Methods in 
Supplementary Data). Fewer than 4% of the CRMs were 
missing an ortholog in one or more species, which elimin- 
ates any potential bias toward CRMs for which more 
species are alignable. 

Locus length-aware Hypergeometric test 

A significance test of overlap between two gene sets is trad- 
itionally performed with the Hypergeometric test. However, 
this test assumes that every gene is equally likely to be 
selected a priori. This assumption is not true for the 'pre- 
dicted gene set' in our setting, which is the set of genes 
nearest to predicted CRMs, since gene loci vary in length 
across the genome (26). We confirmed that such variation 
in locus lengths is true for our data sets (Supplementary 
Figure SI A). We designed the 'locus length-aware 
Hypergeometric test' (LLHT) to correct for this bias. 
Our key idea is to assign to each gene a prior probability 
of being sampled, in proportion to its locus length. Other 
than this modification, the LLHT calculations are identi- 
cal to the Hypergeometric test. We compared evaluation 
P-values obtained using LLHT with those based on the 
Hypergeometric test, and found the former to be notice- 
ably more conservative for most of the 33 data sets 
(Supplementary Figure SIB). We also undertook a 
simulation-based experiment to assess the specificity of 
the two tests, and found that the Hypergeometric test is 
highly susceptible to false positives when the gene set has 
atypical locus lengths, while the LLHT retains its high 
specificity in this scenario (Supplementary Table SI). 

Drosophila reporter constructs and transgenic analysis 

Genomic sequences were generated by PCR and con- 
firmed by sequencing; genome coordinates for pre- 
dicted and tested CRMs are provided in Supplementary 
Table S2. The putative CRM sequences were subcloned 



into plasmid pattBnucGFPs, a tf>C3 1 -enabled Drosophila 
transformation vector containing EGFP under the control 
of a minimal hsp70 promoter (details available on request). 
Transgenic flies were produced by Genetic Services Inc. 
(Cambridge, MA, USA) by injection into line attP2. 
Homozygous transgenic embryos were collected, fixed 
and stained with antibodies to GFP (Abeam, ab290) 
using standard methods. Muscle expression was visualized 
using anti-Tropomyosin (Babraham, MAC 141). DIC mi- 
croscopy was performed using a Zeiss Axioplan micro- 
scope with a Retiga-EXi camera (Qimaging) and 
Openlab software (Improvision). Fluorescent images 
were acquired using a Leica SP2 confocal microscope. 
All images were color corrected and contrast adjusted in 
Adobe Photoshop. 



RESULTS AND DISCUSSION 

The problem 

Suppose we are given a set T of CRMs (called the 'training 
set') that are known to drive similar expression patterns. 
For example, these could be CRMs that regulate genes 
with early mesodermal expression or genes that are invo- 
lved in wing development. The task is to search the genome 
for other CRMs that drive an expression pattern similar to 
those in the training set T. A successful method to perform 
this task will capture the m-regulatory signatures of TFs 
regulating T, and search genome wide for occurrences of 
those signatures. A predicted CRM can be validated by 
testing its function in a reporter gene assay and judging if 
its expression pattern is similar to that of the training set. 
We refer to the task just defined as the 'supervised CRM 
prediction' problem. Note that the problem is not com- 
pletely specified in this formulation, since we do not pre- 
cisely lay down what 'similarity of expression pattern' 
means. Instead, we allow the domain specialist to deter- 
mine which characteristics of an expression pattern are 
used to define T and to judge the success of a prediction. 
One simple way to define T (in Drosophila) is to search the 
REDfly CRM database (27) with specific expression 
terms, such as 'mesoderm' or 'wing', thereby retrieving 
CRMs that have been shown to drive expression in the 
mesoderm or wing of the fly, respectively. 

It is possible that a predicted CRM upon experimental 
testing shows regulatory activity in agreement with a 
neighboring gene's expression pattern, but this regulatory 
activity does not match that of the training set T. This is 
possible if TFs regulating the training set pleiotropically 
participate in a different regulatory network; the CRM 
prediction method may correctly learn the cw-regulatory 
signatures of such TFs and correctly identify a CRM that 
has those signatures, but the CRM may belong to the 
second network and hence show activity that does not 
match the training set. In cognizance of the above possi- 
bility, we define a second alternative criterion for a suc- 
cessful prediction: a CRM that drives an expression 
pattern matching that of a neighboring gene. In our ex- 
perimental evaluations, we will examine success rates for 
prediction using both criteria introduced here. 
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A new scoring scheme for supervised CRM prediction 

An obvious approach to supervised CRM prediction is to 
scan the genomic sequence with a shifting window of fixed 
size, and score the window for similarity to the training set 
of CRMs. The crucial aspect of this approach is the 
scoring system for matching a candidate module ('test 
CRM') to the training set. Different scoring systems 
have been proposed in the past, as reviewed and evaluated 
in our previous work (20). All of these scores have been 
based on comparing the frequencies of short words 
(henceforth called 'k-mers') between the test CRM and 
the training set. This is a reasonable strategy, since the 
regulatory function of CRMs is believed to be encoded 
in the types and numbers of transcription factor-binding 
sites. Specifically, binding sites that occur repeatedly in the 
training CRMs ought to cause some level of overrepre- 
sentation of specific k-mers, and the occurrence of such 
words in the test CRM should then indicate functional 
similarity. The kinds of scores that have been explored 
in this context include Markov chain-based scores (22), 
dot product-based scores (28), scores based on Poisson 
statistics (29) and others (30,31). For statistical conveni- 
ence, these scores have been based on frequencies of 
k-mers of a fixed length (chosen from the range 5 to 8). 
However, in reality, the recognition motifs of different 
TFs are of varying lengths and binding sites of the same 
TF are known to display some amount of variability in 
sequence. For these reasons, the use of fixed length k-mers 
in the statistical scores may be problematic. While longer 
k-mers might reflect the binding site pattern more accur- 
ately, they may not have the statistical support neces- 
sary for robust testing, due to the variability among 
sites. Moreover, a CRM comprises binding sites of differ- 
ent TFs, with varying lengths, further undermining the use 
of fixed length k-mers in capturing the overall binding 
site composition of the training set. One natural solution 
to this problem is to start building a model (of the train- 
ing set) with the shortest k-mers and attempt to include 
longer words whenever there are 'enough' occurrences of 
the word in the training data. This is the key idea in the 
statistical model we adopt here: a fcth order IMM (25), 
which is a mixture of Markov models of all orders up to 
order k. As in a fixed order Markov chain, the IMM 
computes the probability of generating a sequence by 
multiplying the probabilities of each character (nucleotide) 
given the characters before it. However, instead of condi- 
tioning on a fixed number (say, k) of previous characters, 
the probability is calculated separately by looking at 1, 
2,...k previous characters, and a linear combination of 
these probabilities is the multiplicative factor contributed 
by the current character (See Materials and methods 
section). 

We train two 5 th order IMMs, one on the training set 
('positive' model) and one on suitably chosen non-CRM 
or 'background' sequences ('negative' model), and score 
every candidate sequence by calculating the (log)-likeli- 
hood ratio between two models. A positive score means 
that the candidate sequence is more likely to have been 
generated by (i.e. is more similar to) the positive model. 
This score is henceforth called the IMM score. 



Exploiting multi-species sequence data 

Functional binding sites are known to be under evolution- 
ary constraints, a fact that can be used to guide the search 
for potential CRMs (32). As a simple implementation of 
this idea, we trained the IMM using training CRMs from 
D. melanogaster as well as their orthologs from 10 other 
Drosophila genomes. The log-likelihood ratio score 
derived from the model(s) trained in this manner is hence- 
forth called the multi-species IMM (msIMM) score. Note 
that this approach does not use alignment information 
except when determining the boundaries of the orthologs 
of a training CRM. That is, it does not rely upon align- 
ments of orthologous CRMs, and thus it is free from arti- 
facts of potential errors in such alignments. Evolutionary 
comparison in this manner also provides a natural way to 
'smooth' the counts of k-mers, as explained next. Binding 
sites are known to be variable, yet standard Markov 
chains as well as the IMM use counts of exact matches 
to a k-mer to quantify the importance of that k-mer in the 
generative model. Previous methods have attempted to 
address this issue by allowing for a small number (e.g. 1) 
of mismatches to the word in counting its occurrence — a 
strategy that is rather simplistic in its modeling of binding 
site variability. Examining many orthologs of a CRM 
provides a natural way to capture the variability of 
binding sites, to be used in training a model. 

Evaluation methodology 

Given a set of CRM predictions genome wide, the defini- 
tive test of their accuracy would be to determine experi- 
mentally if each CRM drives an expression pattern in the 
expected tissue and/or developmental stage. Since such 
tests are typically not performed en masse (due to resource 
limitations), we need an in silico evaluation strategy. In 
our previous work, we proposed the following strategy, 
which exploits existing databases of expression pattern 
annotation across thousands of genes in D. melanogaster 
(33) [BDGP (http://www.fruitfly.org)]. It is possible to 
obtain, for any given training set T, a corresponding set 
G E of genes with expression patterns similar to those of 
T. Successful CRM prediction (by both criteria described 
in 'Introduction' section) implies that CRMs predicted 
using the training set T regulate genes with similar expres- 
sion patterns, i.e. the predictions would be overrepre- 
sented in the control regions of genes in G E . In other 
words, if we consider the CRM predictions for the training 
set T, and denote the set of 'corresponding' genes (e.g. 
closest neighbors) as G T , we could use the statistical sig- 
nificance of the overlap between G T (the 'predicted gene 
set') and G E (the 'expression gene set') as an assessment of 
prediction accuracy (Figure 1 A). This approach to evaluat- 
ing CRM prediction has been used by several groups pre- 
viously (20,22,32,34) and is a natural option to adopt 
when the existing knowledge of functional CRMs in a 
regulatory network is sparse, yet gene expression data- 
bases are available (see Methods in Supplementary 
Data, Notes 1 and 2 for more details). We evaluated the 
msIMM score in this manner, using the LLHT for statis- 
tical assessment of enrichment. This is a generalization of 
the widely used Hypergeometric test, which we designed 
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Figure 1. (A) General scheme for evaluation of a CRM discovery method. We first select a set of genes from BDGP (or FlyBase) with expression 
patterns commensurate with those of the CRMs in a data set. We next take the modules predicted for that data set and extract their nearest 
neighboring genes ('predicted gene set"). (In this step, we ignore predicted modules that overlap with any training CRMs.) Finally, we perform a 
Hypergeometric test of enrichment between the expression gene set and the predicted gene set. (B) IMM of order n is a mixture of Markov models up 
to order n. 'Oth MM', '1th MM',. . ., '«th MM' denote Markov models of order 0, 1, .. ., n, respectively, ojq, . . . ,a>„ are the mixture weights. (C) Scorer 
scores every windows of length 500 bp with 50 bp shift across the entire genome. In this toy example, the score of each sliding window (orange lines) 
is shown as blue bar. 



specifically for assessing the significance of overlap 
between two gene sets (see 'Materials and Methods' 
section; Materials in the Supplementary Data). We used 
33 training sets and their corresponding expression gene 
sets, collectively called 'data sets', from (20), based on the 
REDfiy (35), FlyBase (33) and BDGP (36) databases (see 
'Materials and Methods' section). Figure 1 describes the 
complete pipeline of supervised CRM prediction by the 
IMM score and the evaluation strategy. 

Evaluation results 

The methods evaluated included the single and multi- 
species versions of the IMM score, the three best perform- 
ing methods from (20) — HexMCD [derived from (22)], 
PAC-rc [derived from (29)] and HexYMF-s200-rc, as well 
as a motif-based approach called Clover-ClusterBuster 
(20) — that uses 'Clover' (37) for motif selection and 
'ClusterBuster' (8) for scanning with selected motifs (see 
Methods in Supplementary Data for more details) — and a 
newer alignment-free method from (38) (henceforth called 
AJTO). We also implemented a multi-species version of 
HexMCD (called msHexMCD), as we had done for the 



IMM score. The comparison between IMM and HexMCD 
is of special interest as both are based on a Markov chain 
formulation, but HexMCD uses a fixed order (of 5), in 
contrast to the interpolated order approach of IMM. 
We obtained the 'predicted gene set' (G T , of size 200) 
computed a P-value (LLHT based) for each method — 
data set pair, henceforth called the 'evaluation P- value'. 
We then counted how many data sets (out of 33 in our 
test-bed) yielded an evaluation P-value better than a 
threshold, and varied this threshold. The results are 
shown in Figure 2A. 

• The msIMM score was found to be superior to all other 
methods by this criterion. For example, msIMM had an 
evaluation P-value better than E-5 on eight data sets, 
compared to the second best (non-IMM) method 
msHexMCD for which the corresponding number was 
five data sets. At E-10, the corresponding numbers were 
four data sets for msIMM, one for msHexMCD and 
IMM and zero for all other methods. 

• The msIMM score yields clearly better results than 
the single species IMM score, indicating the advan- 
tage of using multi-species data in the training phase. 
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This trend was also seen when comparing the multi- 
species version of HexMCD to the corresponding 
single-species score used in (20). 
• The single species IMM score was found to be superior 
to the single species HexMCD score and to the other 
single species scores (PAC-rc, HexYMF-s200-rc), re- 
vealing the advantage of using variable length words 
to model CRMs. 

The AJTO method led to no data sets with evaluation 
P-value < 0.001. Since we have relatively little experience 
as users of this software (38), we chose to exclude it from 
further comparisons, to avoid making any biased infer- 
ences about its accuracy. 

To provide a more detailed view of the above compari- 
sons, we plotted the overlap between the expression gene set 
and the top k predicted genes, as a function of k (Figure 2B; 
Supplementary Figure S2A and S2B), for one particular 
data set (complete results in Supplementary Figure S2A). 




0 50 100 150 200 

Number of predicted genes 



Figure 2. Evaluation of methods. For each method, shown is (A) the 
number of data sets for which the evaluation P-value is significant at 
different LLHT P-value thresholds. For clarity, all single species 
methods are formatted as dashed lines and the two multi-species 
methods are shown as solid line. (B) F-axis shows the number of 
overlaps between the expression gene set and the top k predicted 
genes for 'imaginal_disc.2' data set. (See Supplementary Figure S2A 
and S2B for all other data sets.). 



While most methods are comparable to each other in 
the high specificity range (e.g. <75 predictions), the 
msIMM method distinguishes itself from the others in 
the top 100-200 predictions. (Note that the set G T is of 
size 200 for each method, for fair comparison above.) We 
note, however, that this particular pattern of improved 
performance is not common to all data sets 
(Supplementary Figure S2A). 

We also compared the evaluation P-value of msIMM to 
the P-value of the best method reported in (20), for all 15 
data sets where this evaluation was performed in (20). 
These are listed in Table 1. (See Supplementary Table S3 
for evaluation P-value of all methods in all data sets.) In 
every data set, the msIMM P- value is better than or similar 
to that of the competing method, with the most dramatic 
improvements observed in the data sets 'cns.l' (E-5 to 
E-9), 'imaginal_disc.l' (E-6 to E-ll) and 'ventral_ 
ectoderm. 1' (E-4 to E-12). 

In the above presentation, a method's performance is 
assessed based on the overlap between the predicted gene 
set (200 genes with the strongest predicted CRMs in their 
control regions) and the expression gene set (see Methods 
in Supplementary Data, Note 1). We also examined the 
entire distribution of a method's score for each gene in the 
expression gene set and compared it to the score distribu- 
tion from a random collection of genomic segments (with 
lengths matching the gene control regions). These two dis- 
tributions were compared using a Wilcoxon rank sum test, 
and the resulting P-value indicates how well the particular 
score discriminates the expression gene set (which should 
have CRMs) from the randomly chosen segments (which 
should only have coincidental occurrences of CRMs). The 
P-values are shown for each data set and each method in 
Table 2. It is clear from this evaluation that the msIMM 



Table 1. Comparison between evaluation /"-values of msIMM and 
the best method from (20), for the 15 data sets that were 
reported on by (20) 



Data set Universe Expression Kantorovitz msIMM 





size 


set size 


et al. (2009) 




blastoderm. 1 


5506 


206 


2E-10 


5E-13 


cardiac_mesoderm. 1 


5506 


162 


3E-01 


5E-02 


cns.l 


5506 


839 


3E-05 


6E-09 


eye.l 


13 155 


73 
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The second and third columns show the total number of genes in the 
expression data source and the size of expression gene set for each data 
set, respectively (the size of the predicted gene set is 200 for all data 
sets). The lowest (best) /"-value for each pair is shaded. Note that the 
P-values reported here are LLHT P-values, which are different from 
the standard Hypergeometric P-values shown in Table 2 of (20). 
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Table 2. Discrimination between control regions of an expression gene set and random sequences of matching lengths 
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The second and third columns show the number of training CRMs and the size of expression gene set, respectively, for each data set. Scores of genes 
in an expression gene set were compared to scores of a collection of randomly chosen genomic regions. The score of a sequence is the maximum score 
in that region, under a CRM prediction scheme. For each gene in the expression gene set, 50 random genomic segments of length equal to the gene's 
territory length were included in the random collection. The last seven columns show the P-values (Wilcoxon rank sum test) of such a comparison for 
each data set and for each method. The best /'-value for each data set is shaded. The last row indicates the number of times that a method is superior 
(smallest P-value). 



score is best able to discriminate between the expression 
gene set and the random set, for the most number of data 
sets. 

Experimental validation 

We chose 12 putative CRMs predicted using the 'meso- 
derm' and 'somatic muscle' training sets (six for each), and 
tested them for regulatory activity in vivo. These two data 
sets were chosen partly because their evaluation ^-values 
were significant but not among the best. (This is especially 
true for the 'somatic muscle' set; evaluation P = 0.03.) 
Unlike (20), where the tested CRMs were from the data 
set with the strongest evaluation /"-value ('blastoderm', 
P = 5E-13), our goal here was to investigate the effective- 
ness of our methods in what seem to be more challenging 
data sets. We also note that these two data sets are not 
exclusive from other data sets (e.g. 'visceral mesoderm') in 
terms of their defining expression patterns, which creates 
additional difficulties for supervised CRM discovery. 



The 12 candidate CRMs were selected by the following 
criteria: 

(1) they are located near genes in the respective expres- 
sion gene set; 

(2) they collectively fall across the spectrum of msIMM 
scores, allowing us to examine (through direct experi- 
mental assays) the accuracy of this new scoring sc- 
heme not just at its highest scoring predictions but 
also at the medium and low scoring predictions; and 

(3) each candidate scores well by either msIMM or PAC- 
rc, i.e. a candidate with a relatively low msIMM score 
(<0) has some supporting evidence in favor of it 
being a potential CRM. 

Figure 3 summarizes the results of our experiments. Of 
the 12 tested sequences, 10 (83%) were found to drive re- 
porter gene expression in a tissue- or stage-specific pattern 
(see Methods in Supplementary Data, Note 3). The two 
failed predictions (corresponding to genes tkv and tsh) were 
the two lowest scoring predictions of msIMM (in this set). 
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Figure 3. In vivo validation of 12 Drosopliila CRM predictions. Transgenic embryos were stained with antibodies to GFP to detect reporter gene 
expression. Embryos are oriented anterior to the left. Panels A, F, G, K, M, O and S are dorsal views; H and N are ventral views; the remainder are 
lateral views with dorsal to the top. 'Pattern-specific' CRMs are shown in panels A-L and non-pattern specific CRMs in panels M-T. (A) The 
Antp_161 CRM drives expression in several tissues, including the visceral mesoderm (arrowhead) and the non-mesodermal midgut (asterisk). The 
endogenous Antp gene is expressed in a much more restricted manner, suggesting that the reporter gene expression is either ectopic or that the CRM 
is associated with a different gene of undetermined identity. (B) Co-labeling for Tropomyosin (magenta) shows GFP expression (green) in somatic 
muscles, in particular the lateral transverse muscle fibers (arrowheads). (C and D) The how_171 CRM drives expression in the mesoderm in both 
mid-stage (C) and late-stage (D) embryos, consistent with how gene expression. (E-H) The noc_532 CRM is active in many Aioc-positive tissues 
throughout embryogenesis. Pictured is metameric expression in both ectoderm and mesoderm at stage 9 (E), in the visceral mesoderm (F, arrow- 
head), in the mesodermally-derived lymph glands (G, arrowhead), the ventral nerve cord (H, arrow) and the hindgut (H, arrowhead). (I) Mhc_537 
regulates reporter gene expression (green) in a subset of M/;c-positive mesodermal cells including longitudinal visceral muscles (arrow) and several 
ventral oblique somatic muscle fibers (arrowheads). (J) Longitudinal visceral muscle precursors express GFP under the control of the slou_847 CRM 
(arrowheads), cells not positive for endogenous slou expression. Expression is also observed in the supraesophogeal ganglion (K) and ventral nerve 
cord (data not shown), consistent with known slou expression patterns. Inset shows a more dorsal view of the anterior portion of the embryo in the 
main panel. (L) mbc_64 drives expression in the mftopositive mesoderm, pictured here at stage 16. (M and N). Expression driven by the lola_648 
CRM is confined to the central nervous system in both mid-stage (M) and late-stage (N) embryos, consistent with tola expression. lola_648 overlaps 
the independently discovered CRM40 of (40). (O) The slou-828 CRM regulates reporter gene expression in tissues that are not s/cw-positive including 
cells in the antenno-maxillary complex (black arrow), the posterior spiracles (arrowhead) and cells in the anterior and posterior portions of the 
foregut (white arrow and data not shown). (P) .?/o«_<52S-controlled expression is also observed in the midgut, consistent with normal slou expression. 
(Q and R) The rib_120 CRM drives expression throughout hindgut development, part of the normal rib expression pattern. (S and T) Reporter gene 
expression regulated by the ama_299 CRM is restricted to the central nervous system. Although ama is not expressed in the CNS of late-stage 
embryos (41), earlier expression in the ventral midline beginning at stage 8 (data not shown) is consistent with ama expression at that stage. 



(continued) 
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We then examined the two specific success criteria we laid 
down in the beginning ('Introduction' section). Due to the 
non-exclusive and overlapping nature of the expression 
patterns found within each of the two training sets, we 
considered any mesodermal gene expression as matching 
the training pattern. Of the 12 tested sequences (50%), 6 
showed reporter gene expression in one or more mesoder- 
mal tissues (Figure 3A-L). Remarkably, the msIMM 
score was able to predict success by this criterion with 
90% accuracy. In particular, a threshold value of the 
msIMM score was able to separate the pattern-specific 
CRMs from the non-specific CRMs (or non-CRMs) 
with only one false positive error (Figure 3U). Of the 12 
predicted CRMs, 10 (83%) drive an expression pattern 
matching at least in part that of the endogenous expres- 
sion of the nearest gene. 

Visualization tool 

In addition to making the source code directly available 
for download, we have developed a plugin for Gbrowse 
(39), called 'Supervised CRM discovery', through which 
the user can select a set of related CRMs as their starting 
point and browse a genomic region of interest for other 
putative CRMs with similar functionality. The plugin can 
be accessed via: http://veda.cs.uiuc.edu/gs. 



CONCLUSION 

We have presented a new statistical method for genome- 
wide prediction of CRMs, beginning with a collection of 
known modules in the regulatory network of interest. This 
problem of 'supervised CRM prediction' was visited in 
our previous work (20), where we established the neces- 
sary benchmarks and evaluation methodology, and also 
compiled a suite of pre-existing and novel methods to 
solve the problem. The new method we propose now is 
shown to be superior to the best of the methods con- 
sidered in (20), hence advancing the state of the art on 
this important topic. 

Our work makes three additional regulatory 
networks — central nervous system, imaginal disc and 
ventral ectoderm — amenable to supervised CRM predic- 
tion. Our experiments also contribute 10 new CRMs, 
involved in development of early mesoderm and of 
somatic muscle, to the literature. 

Our new method, along with our previously compiled 
techniques (20), represents a flexible, accurate way of iden- 
tifying regulatory sequences, especially for less fully 
described processes for which we lack the knowledge of 
relevant transcription factors. It can prove to be an im- 
portant in silico adjunct to and extension of empirical 



CRM discovery approaches. Furthermore, it has the po- 
tential to lead to novel CRM predictions in experimentally 
less well-characterized arthropods such as mosquito, 
wasp, beetle and honeybee, while exploiting existing col- 
lections of CRMs in Drosophila. 

We have also made a key improvement to the evalu- 
ation methodology for genome-wide CRM prediction 
methods, by proposing the locus LLHT. This new test is 
shown to lead to superior specificity (lower false positive 
rate) when enrichment tests are performed with gene sets 
of unusually large locus lengths. The Hypergeometric test 
of significance of overlap between gene sets forms the basis 
of many tools used in genomics, and the common phe- 
nomenon of locus length variability therefore casts 
doubts on the statistical inferences from these tools (26). 
Our modification to the Hypergeometric test (LLHT) thus 
has the potential of widespread usability in contexts 
beyond CRM discovery. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online. 
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